Information Acquisition and Exploitation in Multichannel Wireless 

Networks 

Sudipto Guha* Kamesh Munagala^^ Saswati Sarkar* 

April 10, 2008 

00 

o 
o 

CN Abstract 



< 

o 



A wireless system with multiple channels is considered, where each channel has several transmission 
states. A user learns about the instantaneous state of an available channel by transmitting a control packet 
in it. Since probing all channels consumes significant energy and time, a user needs to determine what 
and how much information it needs to acquire about the instantaneous states of the available channels so 
that it can maximize its transmission rate. This motivates the study of the trade-off between the cost of 
I— I information acquisition and its value towards improving the transmission rate. 

C/^ A simple model is presented for studying this information acquisition and exploitation trade-off 

^^ when the channels are multi-state, with different distributions and information acquisition costs. The 

^ objective is to maximize a utility function which depends on both the cost and value of information. 

O Solution techniques are presented for computing near-optimal policies with succinct representation in 

polynomial time. These policies provably achieve at least a fixed constant factor of the optimal utility 

,__{ on any problem instance, and in addition, have natural characterizations. The techniques are based on 

J> exploiting the structure of the optimal policy, and use of Lagrangean relaxations which simplify the 

■^ space of approximately optimal solutions. 

^ 1 Introduction 

QQ Future wireless networks will provide each terminal access to a large number of channels. A channel can 

O for example be a frequency in a frequency division multiple access (FDMA) network, or a code in a code 

J> division multiple access (CDMA) network, or an antenna or a polarization state (vertical or horizontal) of 

an antenna in a device with multiple antennas (MIMO). Several existing wireless technologies, e.g., IEEE 
802. 1 la [ 1 1, IEEE802. lib [ 15 1, IEEE802. 1 Ih [2 1 propose to use multiple frequencies. For example, IEEE 
802.11a protocol has 8 channels for indoor use and 4 channels for outdoor use in the 5GHz band, while 
the IEEE 802.1 lb protocol has 3 channels in the 2.4 GHz band. The potential deregulation of the wireless 
spectrum is likely to enable the use of a significantly larger number of frequencies. Due to significant 
advances in device technology, laptops with multiple antennas (antenna arrays) incorporated in the front lid, 
and devices with smart antennas have already been developed, and the number of such antennas are likely 
to significantly increase in near future. 
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The increase in the number of channels is expected to significantly enhance network capacity and enable 
several new bandwidth-intensive applications as multiple transmissions can now proceed simultaneously in 
a vicinity using different channels. Furthermore, the availability of multiple channels substantially enhances 
the probability (at any given time) of existence of at least one channel with acceptable transmission quality, 
since the transmission quality of the individual channels stochastically vary with time and location of the 
users. These benefits can however be realized only if the users can select the channels efficiently. 

Most of the existing channel selection strategies assume complete knowledge of instantaneous transmis- 
sion qualities of all channels. We refer to this approach as "complete information based optimal control". 
Note that a user can only learn the instantaneous state of a channel by transmitting a control packet in it and 
subsequently the receiver informs the sender about the quality of the channel in a response packet (e.g., the 
RTS and CTS packet exchange in IEEE 802.11). The exchange of control packets in this probing process 
consumes additional energy, and prevents other neighboring users from simultaneously utilizing the channel. 
Probing a channel is therefore associated with a cost. When the number of available channels is large, the 
cost incurred in learning the instantaneous transmission qualities of all channels may become prohibitive. 
Owing to this cost, some recent papers have investigated selection strategies that assume no knowledge of 
instantaneous transmission qualities of any channel [27]. This approach, which we refer to as "minimal 
information based optimal control", may however attain significantly lower transmission rates owing to sub- 
optimal selection of channels. We seek to design a framework that attains any desired trade-off between the 
above extremes using only simple control mechanisms. Specifically, we develop a framework for partial 
information based stochastic control which, in accordance with the costs and the benefits of probing dif- 
ferent channels, determines both (a) the amount of information a user must obtain about the instantaneous 
transmission qualities of the channels at its disposal and also (b) how to select the channels based on the 
acquired information. 

We consider a single sender with access to n channels. Every time the sender probes a channel it 
learns about the signal to noise ratio and thereby the probability of success in the channel, but also incurs a 
certain cost which may again be different for different channels. Before each transmission, the sender needs 
to determine how many and which channels it will probe and also the sequence in which these channels 
will be probed {probing policy). In this paper, we consider the scenario where a sender can transmit in 
only one channel in a time slot and transmits at most one packet in each slot. Based on the outcomes of 
the probes, a sender decides whether to transmit or defer transmission until transmission qualities improve 
{transmission policy). If the sender decides to transmit, it must select one of the available channels (channel 
selection policy), which need not be those that it has probed. We seek to design a jointly optimal probing and 
channel selection policy that maximizes a system utility which is the difference between the probability of 
successful transmission and a suitably scaled expected probing cost before each transmission. Loosely, this 
utility function represents the gain or the profit of the sender if the sender receives credit from the receiver 
for each packet it delivers successfully and needs to additionally compensate the wireless provider for each 
probe packet it transmits. 

Technical Hurdles and Contributions: The optimal probing policy needs to probe adaptively, i.e., the 
result of a probe determines the channels to be probed subsequently. For example, consider channels with 3 
possible states (0,1, 2), each of which is associated with a different transmission quality. Clearly, the probing 
terminates if a probed channel is in the highest state. Now, let a probed channel be in the intermediate state 
(state 1). Then the subsequent probes should be limited to channels that have high probabilities of being in 
the highest state. However, if all channels that have been probed in a slot are in the lowest state, then the 
channels that have high probabilities of being in the intermediate state may also be subsequently probed. 
Even the decision regarding which channel should be probed first depends on the high order statistics of 
all the channels in a complex fashion. This is because the optimum policy may not probe a channel if 



its quality has a low variance as probing it does not provide significant information but incurs additional 
cost. Example |5.1| in Section |5] illustrates these points. Also, the channel selection decision depends on the 
outcomes of the probes and the expectation and uncertainty of the transmission quality of the channels that 
have not been probed. The optimal policy is therefore a decision tree over n variables (Figure [TJ and can 
be computed by solving a dynamic program with ^{K2^) states - naive computations will require both 
exponential time and exponential storage space. The above observations rule out greedy strategies for 
computing the optimal solution. Our main contribution is in showing that despite these hurdles, there is a 
nice (albeit non-trivial) combinatorial structure in the optimal decision trees. We then use this structure to 
design simple, natural, and polynomial time greedy algorithms which provably achieve at least 4/5 of the 
optimal utility on any problem instance. 

A nice feature of our algorithmic framework is that it easily extends to handle other constraints on the 
problem, as we elucidate next. For example, when the sender is not saturated, that is it does not always have 
packets to transmit, it need not transmit packets in every slot, and therefore needs to jointly optimize the 
transmission, probing and channel selection policy so as to attain the maximum utility. This presents addi- 
tional technical challenges since the transmission policy needs to take into account two conflicting criteria: 
the transmissions should only happen when some channel is observed to be in a very high quality state, but 
on the other hand, they should happen frequently enough to maintain stability. This leads to a rate constraint 
in the corresponding optimization. We show a novel technique based on linear programming duality to 
handle the rate constraint, and present simple polynomial time computable greedy policies which provably 
approximate the optimal policies. In addition, these greedy policies are the most natural threshold policies, 
where the decision is to transmit only when the reward from transmission exceeds a certain threshold. This 
threshold depends on the arrival rate, and is computed by a simple parametric search. 

Summary of Contributions: In summary, our main contribution is to obtain succinct polynomial time 
computable joint probing, selection and transmission policies that provably attain utility values which are 
within constant factors of the optimal utilities. More specifically: 

1 . We first consider the case that a sender is saturated and seek to determine the policy that has the max- 
imum utility. We prove that when each channel has two states an optimal policy can be computed in 
0{n log n) time (Section[4]l. When each channel has K states, we obtain a policy that provably attains 
4/5 of the maximum utiUty and can be computed in 0{n^K) time (Section pi). These performance 
guarantees hold even when different channels have different distributions for the state processes and 
different probing costs. In the special case in which all channels have equal probing costs, but po- 
tentially different distributions for the state processes, we present a parametrized probing and channel 
selection policy whose parameters can be appropriately selected so as to attain any desired trade-off 
between performance guarantee and computation time (Section [6]l. 

2. We next consider the unsaturated sender scenario where packets arrive with a given rate A, so that the 
sender will not have packets to transmit in every slot (Section [7]). The goal now is to determine the 
policy that attains the maximum utility among all stable policies. We prove that when each channel 
has two states such a policy can be computed in 0{'rr') time and in the case where each channel has K 
states, we show that a stable policy that provably attains 2/3 of the maximum utility among all stable 
poUcies can be computed in 0{n?K{n + K)) time. 

All policies can be readily implemented in resource constrained devices as once computed they can be 
executed in 0{n) time and stored in 0{n + K) space. 

Our results are somewhat surprising given that optimal solutions for most partial information based 
control problems are possibly computationally intractable, and standard approximation techniques either 
do not provide guaranteed approximation ratios or require exponential computation times [5J. Our proofs 



therefore rely on exploitation of specific system characteristics and employ techniques that are not standard 
in context of stochastic control. The techniques we develop are very natural and general; they are expected 
to have wider applicability in designing simple and intuitive heuristics for a larger class of problems in 
the broad area of partial information based control problems, and in particular the joint optimization of the 
reward obtained from informed selections and the cost incurred in acquiring the required information. We 
will explore this in future work. 

2 Related Literature 

We first discuss the relation of our problem with some classical problems like the stopping time and multi- 
armed bandit problems. The most well-researched version of the stopping time problem is a stochastic 
control problem that optimally selects between two possible actions at any given time: to continue or to 
stop [10]. Recently, the results for this problem have been used to solve partial information based control 
problems for statistically identical channels with equal probing costs |[T9l l24l . Empirical investigations 
indicate that different channels available to a sender oftentimes have different statistics |[T6l . When channels 
have different statistics and/or different probing costs, which is the case we consider in the paper, the optimal 
action needs to be selected from multiple options at any given time - the options being (a) whether to continue 
probing (b) which channel to probe next if the decision is to probe and (c) which channel to transmit if the 
decision is to stop probing. Thus, the results from the above version of stopping time problem do not apply 
in our context. The optimal stopping time problem has also been considered in a more general setting where 
the number of available actions may be more than two; our problem is in fact a special case of this general 
version (Chapter IV, O). In this general case, the process terminates in certain states, which constitute the 
termination set, and selects the optimal action in other states. But, so far, only certain broad characterizations 
of the termination set are known in this general case, and the optimal actions when the decision is not to 
stop are also not known in closed form |5]. Thus, these general results do not lead to the optimal policies 
we are seeking to characterize. 

The stochastic multi-armed bandit problem considers a bandit with n arms |[T4ll . The system can try 
one arm in each slot, and when it tries an arm, it receives a random reward which depends on the state of 
the arm. The state of an arm changes only when the system tries it. The reward of a system in T slots is 
the sum of the rewards in each slot. The goal is to maximize the expected reward in T slots. Our problem 
differs from the above in that (a) the state of a channel can change even when it is not probed or used for 
transmission and (b) a node can learn the states of multiple channels in an epoch while incurring additional 
probing costs for learning the state of each additional channel and it can choose each such channel adaptively 
after observing the states of the channels chosen for probing before in the same slot. The adversarial multi- 
armed bandit problem lO and the restless bandit problem ||6l |28| remove one of the above differences in 
that they allow the state of an arm to change even when the system does not try it. But, the adversarial 
multi-armed bandit problem [3] seeks to optimize the selection under the assumption that the sender uses 
the same arm in all slots. Note that we allow a sender to probe, and also transmit, in different channels in 
different slots. In another version of the adversarial multi- armed bandit problem, the goal is to select the 
arms so as to minimize the "regret" or the difference in expected reward with the best policy in a collection 
of a certain number of given policies [3|. Our problem differs from this version of adversarial multi-armed 
bandit problem and also from the restless bandit problem ||6l |28l in that we allow a node to adaptively 
probe different channels in the same slot by paying additional costs (difference (b) above). Thus the results 
available in this context do not apply in our problem, and we use different solution approach and obtain 
different performance guarantees. 

Optimizing the order of evaluation of random variables so as to minimize the cost of evaluation ("pipelined 
filters") has been investigated in several different contexts like diagnostic tests in fault detection and medical 



diagnosis, optimizing conjunctive query and joint ordering in data-stream systems, web services Bl IISlfTTl 
[T3][l7l|20l|2T]|22l^3l. However our work is different from all the above in that, we (a) consider multi-state 
channel models whereas pipeline filters consider two state models and (b) allow a node to transmit in a chan- 
nel even if the channel has not been probed. Note that usually two state models can not capture the statistical 
variations of wireless channels [16|. As we demonstrate later, both the above generalizations significantly 
alter the decision issues and the optimal solutions (Section[5]). 

Finally, opportunistic selection of channels with complete knowledge of channel states has been com- 
prehensively investigated over the last decade (e.g., ll26l ). But, in general, the area of partial information 
based control problems, and in particular the joint optimization of the reward obtained from informed se- 
lections and the cost incurred in acquiring the required information, remains largely unexplored in wireless 
networks. Policies with provable performance guarantees are known only in special cases like statistically 
identical channels |[8l[l9l|24l, and even under these simplifying assumptions only the saturated sender case 
had been investigated. We consider both the saturated and unsaturated sender case, and in both cases obtain 
provable performance guarantees even when channels have different statistics and/or probing costs. The re- 
sults in this paper therefore enhance the state of knowledge in an emerging area which has hitherto received 
only limited attention. 

3 System Model and Problem Definition 

A sender U has access to n channels which are denoted as channels 1 , 2, . . . , n, each of which has K possible 
states, 0, . . . , K — 1. We assume that time is slotted. In any slot channel j is in state i with probability pij 
independent of its state in other time slots and the states of other channels in any slot. Without loss of 
generality, we assume that pK~ij < 1 for each j, as otherwise the optimum policy is simply to transmit in 
j without probing any channel. In every slot, U transmits at most one data packet in a selected channel. If 
the channel selected for transmission is in state i, the transmission is successful with probability r^. Without 
loss of generality we assume < ro < ri < • • • < rx-i- For simplicity, we also assume that tq = 0; all 
analytical results can however be generalized to the scenario where ro > 0. Whenever U probes a channel 
j, it pays a cost of Cj > 0. Probing different channels may incur different costs as the probing process for 
different channels may interfere with the channel access of different number of users (based on geometry 
and allocation of channels). We now formally define the policies and the performance metrics. 

Definition 3.1. A probing policy is a rule that, given the set of channels the sender has already probed in 
a slot (which would be empty at the beginning of the slot) and the states of the channels probed in the slot, 
determines (a) whether the sender should probe additional channels and (b) if the sender probes additional 
channels which channel it should probe next. The sender knows the state of a channel in a slot if and only if 
it probes the channel in the slot. 

Definition 3.2. A selection policy is a rule that selects a channel for the transmission of a data packet in 
a slot on the basis of the states of the probed channels, after the completion of the probing process in the 
slot. The selection policy can select a channel even if it has not been probed in the slot, and in that case, the 
channel is referred to as a backup channel. 

Definition 3.3. The probing cost is the sum of the costs of all channels probed in the slot. The probing 
cost is clearly a random variable that depends on the probing policy and the outcomes of the probes (as the 
sender may probe subsequent channels depending on the outcomes of the previous probes). The expected 
probing cost is the expectation of this random variable and depends on both the probing policy and the 
channel statistics. 

Definition 3.4. In any slot, the transmission reward is 1 if there is a successful transmission and oth- 
erwise. Therefore, the expected transmission reward is r^ in a slot tifU transmits in a channel in state i 



during t. The expected transmission reward of a policy is therefore ^^ g^rj where qi is the probability that 
the selection policy decides to use a channel which is in state i; qi depends on the channel statistics as well 
as the policy. 

Definition 3.5. The expected utility of the sender, denoted simply as gain, is the difference between the 
expected transmission reward, and the probing cost scaled by a factor k. We denote the gain of a policy n 
as Ctti". 

The gain depends on the probing and selection policies, the channel statistics and the scaling parameter 
K - choosing the scaling parameter k to be makes the policy acquire complete information, while setting 
it to cxD makes the policy acquire no information. Since k can be included in the probing costs themselves, 
we drop this parameter in the remaining discussion without loss of generality. 

In Sections|4}|5]and[6} we assume that C/'s queue is never empty (saturated sender assumption); we relax 
this assumption in Section [7] The two versions of the problem are defined as follows: 

Saturated Sender Problem: Under the saturated sender assumption, at least one policy that maximizes 
the utility transmits a packet in every slot. We therefore assume that U transmits exactly one data packet in 
every slot. The problem formulation for the saturated sender case follows. 

Problem 1. Given {cj}, {rj} and {pij}, find a probing and selection policy so as to maximize the expected 
gain. Let OPT denote the optimal policy and Gopt its gain. 

Since channels are temporally independent, the optimal policy in a slot need not depend on the decisions 
and the observations in other slots. Also, the optimal policy remains the same in all slots, though the specific 
choices it makes may be different in different slots depending on the outcome of the probes. Note that the 
optimal probing policy does not probe any further in a slot if a probed channel is in state K — 1. Using 
these observations, the optimal policy can be computed using a bottom-up dynamic program whose states 
correspond to the tuple consisting of (a) the maximum value of the best state encountered so far and (b) the 
set of channels that have not been probed yet. Thus, the dynamic program has K2" states, and hence, naive 
computations will require J7(iC2'^) time and space. 

Policies and Decision Trees: Every joint policy can be represented by a unique decision tree (Figure [T]); 
we therefore use policies and decision trees interchangeably. 

Unsaturated Sender Problem: We now relax the assumption that a sender always has packets to trans- 
mit. A sender generates packets as per an arrival process which constitutes a positive recurrent, aperiodic, 
irreducible Markov chain. Under the steady state distribution of the Markov chain, the expected number of 
arrivals in any slot is A, where A G [0,1). Packets are stored in an infinite buffer. If in a slot the sender has a 
packet in its queue, the slot is referred to as a busy slot. The sender transmits only in busy slots, but may not 
transmit in every busy slot; it may improve its gain by deferring transmission until at least one channel has 
good quality. The transmission policy is a rule that determines which slots a sender transmits. The decisions 
may depend on the outcomes of the probes, the queue lengths, channel and arrival statistics. The sender 
must however ensure that it transmits at least at the rate at which it generates packets, otherwise its delay 
becomes unbounded. In addition to gain, system stability is therefore of interest. 

Definition 3.6. The system is stable if the sender's expected queue length is finite. A policy that attains finite 
expected queue length is a stable policy. 

Problem 2. Given { Q } , { r j } and {pij } find a probing, selection and transmission policy that stabilizes the 
system and maximizes the gain among all stable policies. Let OptUnSat denote the optimal policy and 
Gu its gain. 



4 The Two-State, Saturated Sender Problem 

We now consider the case that the sender always has packets to transmit and seek to solve Problem 1 
formulated in Section [3] when K = 2. We first consider a specific class of policies, EXHAUST, and prove 
that Opt belongs in this class. Subsequently we show how to find the optimal policy TwoStateOpt in this 
class in 0(n log n) time. Thus, when K = 2, TwoStateOpt is Opt and can be computed in 0(n log n) 
time. Also, TwoStateOpt can be executed and stored in 0{n) time and space. 

Definition 4.1. Given S C {1, . . . ,n},i S, let EXHAUSTfS', i) denote the class of policies which probe 
all channels in S in a deterministic order until a probed channel is in state 1 or all channels in S have been 
probed. It selects the last probed channel if it is in state 1 , and selects i otherwise. Channel i is denoted as 
the backup channel. 

In what follows, we prove that there is an optimal policy which is of the form Exhaust(5', i). We 
note that results proved later in the paper (for the case K > 2) will subsume the proof of this fact - but a 
straightforward application of these later results will yield an algorithm which requires 0{'n?') time. 

Lemma 4.1. There exists an Exhaust(S', i) policy which is optimal. 

Proof. We prove the lemma by induction on the number of channels n. For the base case, n = 1, the 
expected gain is ripn — ci if the optimal policy probes the channel, and ripn otherwise. Since ci > 0, 
the policy that selects the channel without probing is optimal over all possible convex combinations, i.e., 
randomization, of the above two policies. Thus, Exhaust(<1>, 1) is an optimal policy in this case. 

Assuming the lemma holds for n = s, consider a set J of s + 1 channels. Opt can either (a) select 
a channel without probing or (b) probe a channel. Conditioned on case (a), Gopt = Gexhaust($,j)» where 
j = argmaxjpij. In case (b). Opt chooses to probe a channel i with some probability. Subsequently, 
if i is in state 1, Opt selects i. Now, if i is in state 0, then Opt takes the same decisions as that in a 
system with the s remaining channels, and by the induction hypothesis, uses Exhaust(Q, j) policy for 
some Q C J \ {i},j G {J \Q) \ {i}. Thus, in this case. Opt is an EXHAUST({i} o Q, j) policy where 
the o denotes the ordering. Therefore, conditioned on case (b), Gopt is a convex combination of the gains 
of Exhaust policies. Therefore, overall, Gopt is a convex combination of the gains of Exhaust policies. 
Thus, there exists an optimum policy which is Exhaust(S', i). D 

We next prove that Opt satisfies additional properties, which allows a polynomial time computation of 
Opt. 

Lemma 4.2. Let Si = {j : {l — pii)pijri > Cj}.IfEXHAl]ST{S,i) is an optimum policy, then the following 
conditions hold. 

1. channels j in S are probed in decreasing order ofpij/cj 

2. EXHAUSTfS'j, i) policy is an optimum policy. 

Proof. Let Exhaust(5, i) policy be an optimum policy. Wlog. S = {ki, . . . k\g\}, where channel ki is 
probed before fc^+i. Then the gain of Exhaust(S', i) policy is 

\S\ l-l \s\ 

1=1 771=1 m=l 



We first prove (|T|). Recall that pij < 1 for all channels j. Let pik^ /ck^ < Pik^+i /cfc^+i • Consider a new 
policy which probes /c^+i before ks but is other-wise similar to the Exhaust(5', i) policy. Let the gain of 
this new policy be B. Then, A- B = HmiiCl ~ Pikm)iPisCs+i - Pu+iCs)- Thus, clearly, B > A. Thus, 
Exhaust(S', i) is not the optimum policy. The result follows. 

We now prove ^.If S = Si the result follows. Let S j^ Si. Thus, either Si\ S j^ (f) or S\ Si j^ 4>. 

Let5\5i / 0. Consider some j e S\Si.From^,pikJck, > pik,+ J Ck,+-^. Thus, {l-pii)pik^g^ri/ck^g^ 
mini<;<|5|(l -pii)pi/ri/Q < {I - pii)pijri/cj < I. Thus, k\g\ £ S \ ^j. Let Q = 5 \ {fc|5|}. The gain 

of Exhaust((5, i) policy with probing sequence ki, . . . , k^s\-i is -D = Yli=i (Pifcj^i — Cfc;)n^i^]^(l — 

PikJ + Piinn|,^L"i^(l - pikj- Now, D - A = (cfc|g| - (1 - Pii)pife|s|ri)nJ^Li\l - pikj. Since 
(1 — Pii)pik,g,i"i < Cfc|g|, D > A. Thus, EXHAUST(Q,i) is an optimum policy, where Q Q S and 
\Q\ Si\ < \S \ Si\. Continuing this argument, clearly there exists a T such that T C S and T\Si = (j) and 
EXHAUST(r, i) pohcy is optimal. 

Now let 5j \ S" / (p. If S \ Si / i?i>, let T be as constructed in the above paragraph; otherwise let 
r = S". In both cases, Exhaust(T, i) policy is optimal. We now show that Si\T = 4>. If not, consider a 
j £ Si\ T. Let Q = T U {j}. The gain of Exhaust((5, i) policy with probing sequence ki, . . . k\rp\, kj is 

c = El=i(m,n - ck,)u'-U^ - PikJ + {Pijn - c,)n£i (i -pi^j +pi,ri(i -pi,)nj^i (i -pi^j. 

Now, C - A = ((1 - pii)pijri - Cj) njj^(l - _pifc,). Since pis < 1 for all s and (1 - pii)pijri > Cj, 
C > A. This contradicts the optimahty of the EXHAUST(r, i). Thus, Si\T = (p. Thus, Si = T. Hence, 
EXHAUST(5i, i) policy is optimal. 

D 



Lemmas 4.1 and 4.2 prove that there exists an EXHAUST(5i, i) policy that is optimal, and this policy 
probes the channels in Si in decreasing order of pij/cj. The routine Determine Best Backup described 
below determines the BEST Bkup channel i* such that Exhaust(S'j* , i*) attains the maximum gain among 
all such EXHAUST(S'i, i) policies. Note that i* can be computed in 0{n'^) time using a naive implementa- 
tion, but the following computation requires only 0(nlog n) time. 



Determine Best Backup 

1. Sort the channels so as to arrange them in decreasing order of pn/ci. Re-number the channels in accor- 
dance with the sorted order, i.e., if i < j, pu/ci > pij/cj. Let Si = {j : (1 — Pii)pijri > Cj,j ^ i}. 
Let Gain(i) denote the gain of EXHAUST(S'i, i). 

2. Let Dq = 1. For 7 > 0, compute i^j+i = Dj{l — pij). 

I* -Dj+i = Probability that first j channels return state "0" */ 

3. Let Fo — 0. For j > 1, compute -Fj+i = Fj + {pijVi — Cj)Dj. 
I* Fj+i — Gain of the first j channels if probed. */ 

4. For each channeH, if i > jS'il, Gain(i) = -^iSii+i + PiiriD^Si\+i', 
else Gam(i) =Fi+ ^ ^^^^ + |^i:'|s.|+2- 

5. LetBESTBKUP = argmaxj=i_..._„Gain(i). 

We now explain the computations of the gains of ExHAUSTpolicies in the Determine Best Backup 
routine. The channels are numbered in decreasing order of pij/cj. Now, i^|5-|+i is the gain of sequentially 
probing the first \Si\ channels, and D\s.\_^_i is the probability that the first \Si\ channels are in state 0. If 
i > \Si\, then the first \Si\ channels constitute Si. Thus, the gain of Exhaust(S'j, i) is -F|5.|+i +piiD|5-|_|_i. 

If i < l^il, the first \Si\ + 1 channels constitute Si U {i}. Thus, Fj -\ — ^5^1+2 — 1_ — ]l2 — ! — !. j^ t^g gain 
obtained by probing channels in Si in decreasing order of pij/cj, and Di^.m 2/(1 ~ Pu) is the probability 



that all channels in Si are in state 0. Thus, the gain of Exhaust(S'j, i) is Fi + ^'^''+' ■'^|_(p"^i ''^^^^ + 
Jz^D\s^\_^_2. The computation time for the Determine Best Backup routine is dominated by the time 
required to sort the channels, which is 0{n log n) . 

The following TwoStateOpt policy is Exhaust(5'j., i*) where i* is the Best-Bkup channel re- 
turned by the routine Determine Best Backup. 



TwoStateOpt 

1. Probe channels j E S'best-Bkup until a probed channel is in state 1 or all channels in ^Best-Bkup have been 
probed. 

2. If the last probed channel j is in state 1, transmit the packet in j, else transmit the packet in the Best- 
Bkup channel. 



Theorem 4.3. TwoStateOpt attains the maximum gain when K = 2, and can be computed in 0{n log n) 
time. 

Since TwoStateOpt is the Exhaust(S'j. , i*) policy that attains the maximum gain among all ExHAUST(5i, i) 
policies that probe the channels in Si in decreasing order ofpij/cj, its optimality follows from Lemmas 



4.1 



and 4.2 The computation time for TwoStateOpt is the same as that for the routine Determine Best 



Backup which can be computed in 0{n log n) time. Thus, Theorem 4.3 follows. Clearly, TwoStateOpt 
can be executed and stored in 0{n) time and space. 

5 The Multi-State Saturated Sender Problem 

We still assume that the sender always has packets to transmit but now focus on the case that each channel 
has K states where K > 2. We first demonstrate that some natural generalizations of the TwoStateOpt 
policy are suboptimal when K > 2. 

Example 5.1. Recall that TwoStaibOpt probes channels in increasing order ofcj/pij. Thus, for K > 2, 
the natural generalizations o/ TwoStateOpt are to probe channels in decreasing order of the ratio be- 
tween their (a) probabilities of being in the highest state and costs (i.e., PK-ij/cj) or (b) the expected 
rewards and costs (i.e., X]fc=^ Pkji^j/cj)- Figurelljpresents one scenario where both these probing se- 
quences are sub-optimal. Note that i has the least, j intermediate and k maximum expected rewards, and 
P2k/ck > P2j/cj > P2i/ci. But, Ovj probes i before probing j. 

The main challenge for K > 2 is that the optimal probing sequence needs to be adaptively determined 
depending on the outcomes of the previous probes in a slot (Figure [TJ. For example, when K = ?,, and 
when a probed channel is in the intermediate state (state 1), then the subsequent probes should be limited to 
channels that have high probabilities of being in state 2. However, if all channels that have been probed in a 
slot are in state 0, then the channels that have high probabilities of being in state 1 may also be subsequently 
probed. 

We show that in 0{'n?K) time, we can compute a policy which attains 4/5 of the maximum gain, for 
arbitrary distributions for the state processes and costs. However, more importantly, we develop techniques 
and ideas, such as the Structure Theorem below, which are useful beyond the context of this specific problem. 
In fact, we will use the structure theorem for all the problems considered in the rest of this paper. 
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Figure 1: Consider a node U that has access to 3 channels i,j,k each of which has 3 states. Let 
r2 = ^,ri = 0.1. The probabilities associated with different states of i,j,k are (0.49,0.02,0.49), 
(0.5,0.01,0.49), (0.5,0.5 - 6,6) respectively. Also, a = 0.05885(5, Cj = 0.06(5, Cfc = 0.05(5. Let 6 < 0.15. 
The figure shows the decision tree for Opt in this case. A channel is probed at each probe node, and the 
letter inside it indicates which channel is probed at the node. The numbers next to the branches indicate the 
outcome of the probe. The number r/s next to a branch indicates that both states r and s of the previously 
probed channel lead to the same action. For example, the sender first probes channel i. If i is in state 2, it 
transmits in i. If i is in state 1 and 0, it probes k and j respectively. 



5.1 The Roadmap and the Main Results 



The main component of the algorithm is the following theorem which is proved in Section 5.2 



Theorem 5.1 (Structure Theorem). There exists an optimum policy which uses a unique backup channel. 

The theorem states that there exists an optimum policy Opt and a channel I such that whenever Opt 
uses a backup, it uses £ as a backup. Note that the above is true for the case K = 2 trivially, because if 
at any point the sender observes a channel to be in state 1, there is no further benefit of probing. Thus the 
strategy corresponds to a path where we observe every probed channel to be in state 0. Note that another 
interesting property of the case K = 2 is that the backup channel is never probed. This motivates the 
following definitions. 

Definition 5.1. Let V{i) denote the class of policies, each of which (a) never probes i and (b) never uses 
any channel other than I as a backup. Let 'P(O) correspond to the class of policies that never use backup 
channels (note that the channels are numbered 1 , 2 , . . . , nj. Let the policy that attains the maximum gain 
among all policies in V{t} be denoted as ResERVEBkup(£), and the policy that attains the maximum gain 
among all policies in U^qP(£) be denoted as BestReserveBkup. Let G BestReserveBkup be the gain of 
BestReserveBkup. 

The following theorems indicate why BestReserveBkup is of interest. 

Theorem 5.2. For i = 0,1, . . . ,n, we can compute a policy ReserveBkup(£) that attains the maximum 
gain among all policies in P{t} in time 0{nK\og n). 



tion 



5.4 



ReserveBkup(£) has been presented in Section 5.3 and the above theorem has been proved in See- 
Therefore, clearly, we can compute BestReserveBkup in time 0{n?'K\ogn). In Section 5.3 



we argue that we can in fact compute BestReserveBkup in time 0{n^K). 
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Theorem 5.3. GBestResen-eBkup > (4/5)Gopf. 

Proof. By the structure Theorem (Theorem |5 .11 ), there exists an optimum policy Opt that uses a unique 
backup, say B. Let a denote the probability with which Opt uses ^ as a backup. Construct a new policy A 
that is similar to Opt except that it probes B whenever Opt uses B as a. backup. Clearly, A attains a gain 
of at least Gopt — acB- Also, since A £V{0), its gain is at most GBestReserveBkup- Thus, 

CBestReserveBkup > Gopt — aCB- (1) 

We now show that there exists a policy which never probes B, but does not perform significantly worse 
than Opt. In this discussion, the gain G{T) of a sub-tree T rooted at t is defined as the expected reward 
owing to transmissions at the leaves of T conditioned on reaching t, minus the expected probing cost of 
nodes in T. 

Suppose that Opt probes B at nodes mi, . . . , mj in its decision tree. Let /?i, . . . , /Jj be the respective 
probabilities that Opt traverse these nodes and Gi, . . . , Gj be the respective gains of Opt given that it 
traverses these nodes. Now consider the gains of the subtrees G'^, . . . , G'j produced by modifying the 
decision tree so that B is not probed. This produces a decision tree r which is the same as that for Opt 
except for the trees rooted at mi, . . . , mj. r reaches these nodes with the same probabilities as Opt. We 
now make a claim which allows us to complete the proof, and subsequently we prove why the claim is true. 
This claim would also be used for other results. 

Claim 5.4. Let T be any decision (sub-)tree rooted at node t where we probe channel u, and let its gain be 
G{T). Suppose at the point of arriving at decision node t, the best probed channel has state at least j > 
(equivalently, reward at least rj). Then there exists a corresponding (sub-)tree T', where u is not probed at 
t or anywhere else in T', and whose gain GiT') satisfies G{T') > G{T) + Cu — Ylii=j+i Piufi- 

We first complete the proof of the theorem using the above Claim |5.4[ Setting j = it follows that 
Gk — G'f^ < Yli=o PiB^i — Ci. Then the difference between the overall gains of Opt and r is Yl,k=i PkiGk — 
G'l^). Since X]t=i A < 1 — a, the policy r which never probes B, attains a gain of at least Gopt — (1 — 
«)(E£oViijri-Cij). 

Since r never probes B, therefore r G 'P{B). Thus we have, 

K-l 

GBestReserveBkup > Gopt — (1 — «)( ^^ PiBTi — Cb)- (2) 

i=0 

From multiplying Equations (IT]) and ^ with (1 — a) and a respectively, and adding the results, we have 
GflestReserveBkup > Gopt - "(I - «) Ylfjo PiBTi- Now, the policy that uses S as a backup without probing 
any channel attains a gain of ^fjo^ PiBri- Since this policy is in V{B), YaJq" Pisri < GBestReserveBkup- 
Thus, GBestReserveBkup > Gopt/(l + «(! — «))• The rcsult foUows sincc the maximum value of the denomi- 



nator is 1.25. This proves the theorem; we now focus on Claim 5.4 



(Proof of Claim 5.4 ). Let Fiu denote the set of leaf nodes in T where the decision is to transmit on probed 
channel u in state i (note that this happens only if i > j). Let W be the event that t is reached, and Fiu be the 
event that Fj„ is reached. First note that Pr[Fju|Ty] < piu for all states i. Now construct a corresponding 
decision tree T" in which the decision at Fiu is to transmit on the (a) best probed channel excluding the 
probed channel u if channels other than u have been probed and (b) a backup channel otherwise. Clearly, 

G{T") > G{T) 



K-l 


K-l 


Y, ^AFiuWYi > G{T) - 


- Y Piun 


=j+l 


i=j+i 
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In the decision tree T" , the transmission is never on probed channel u. Therefore, at the root node t in 
T" where u is probed, suppose we do not actually probe u (saving cost c^), but simply choose the branch 
corresponding to u being in state i with probability piu, this new decision tree T* has gain: 

K-l 

G{T*) > G{T") + cu> G{T) - Y^ p,un + c. 

This new decision tree neither probes nor uses u. Denote the sub-tree of T" below t corresponding to 
channel u being in state i to be Tj. If i* = argmax^j^G(Tj), we have: 

K 

K 



G{T*) = TpiuG{Ti) < maxG{Ti) = G{Ti,) 
^ — ' t=i 



i=l 

Modify T* so that on reaching node t, branch Tj. is chosen. Denote this T'. Then, 

K-l 

G{T') = G{Ti,) > G{T*) > G{T) - J^ Pi^n + c„ 

The result follows. D 

Note that there are cases where BestReserveBkup is strictly suboptimal, (e.g., in Figure [T| where 
Opt probes the backup channel k on some paths). But, in practice, the gain of BestReserveBkup sub- 
stantially exceeds the lower bound in Theorem 5.3 For example, even in Figure [T] (which is one of the few 



cases where we observed the suboptimality of BestReserveBkup) the gain of ReserveBkup(A:), and 
hence that of BestReserveBkup, is only 0.1% less than that of Opt. 

We now point out an important property of ReserveBkup(O) . Recall that 'P(O) consists of all policies 
that transmit only in probed channels and never uses backup channels. Thus, ReserveBkup(O), which 
will henceforth be denoted as OptNoBkup, attains the maximum gain among all such policies. From 



Theorem 5.2 OptNoBkup can be determined in 0{nK + nlogn) time. Thus, the optimum policy is 



polynomial time computable when backups are not allowed. 

Finally, when K = 2, every policy is in the class U'^^qP{u), and hence Theorem 5.2 also proves that the 
optimal policy can be computed in 0{v?) time for K = 2. However for K = 2 we have already obtained 
an optimal policy which has a lower running time. 

5.2 Proof of the Structure Theorem 

Definition 5.2. A < i tree is a decision tree which takes the same decisions irrespective of the states of the 
probed channels provided the states are less than or equal to i. The decisions corresponding to the states 
which are less than or equal to i therefore constitute a path in such a tree which we refer to as a <i path. 



We prove the Structure Theorem 5.1 (in fact a slightly stronger version which we would require later) 
using the following lemma. 

Lemma 5.5. Suppose Opt probes a channel j at a node m in its decision tree and if j is in state i it takes 
a backup downstream. Then there exists another optimum policy OptI which has the same decision tree as 
Opt except possibly for the tree rooted at m. In OptI, the tree rooted at m, 

1. is a < i tree 

2. takes a backup, say i, at the end of its < i path and 

12 



3. takes £ as a backup wherever it takes a backup. 

Proof. We prove the lemma using induction on the states. The lemma holds by vacuity for all channels j, 
node miii = K — 1. This is because Opt never takes a backup after observing a channel in state K — 1. 

Suppose the lemma holds for all channels j, nodes m in the decision tree of Opt, states i + 1, . . . ,K — 1. 
We prove the lemma for all channels j, nodes m and state i. Now, let Opt probe a channel j at a node m in 
its decision tree and take a backup somewhere downstream if j is in state i. Let mi be the first node after j 
is probed at m and observed to be in state i. Clearly, the decision tree rooted at rrii is a. < i tree, and takes 
at least one backup somewhere downstream. 

We will first show that there is one optimum policy Opt2 which is similar to Opt except possibly for 
the tree rooted at mi. The tree rooted at mi in Opt2 is still a < i tree but 

(pi) takes a backup, say £, at the end of its < i path and 

(p2) takes ^ as a backup wherever it takes a backup. 

Suppose the tree T rooted at mi in Opt does not satisfy the above conditions. Then there is a path originating 
from its < i path which ends in a backup. Let m2 be the highest node (i.e., node closest to mi) from where 
such a path originates on the < i path of T. Clearly this path corresponds to a channel being observed in a 
state higher than i, say q, at m,2. From the induction hypothesis, there exists one optimum Opt2 which is 
similar to Opt except possibly for the decision tree rooted at m2. In Opt2 this decision tree 

• is a< q tree 

• takes a backup, say £, at the end of its < q path and 

• takes ^ as a backup wherever it takes a backup. 

m, j probed 




Figure 2: The transformation of Opt2 to OptI in lemma 5.5 



Note that Opt2 satisfies conditions (pi) and (p2) for the tree rooted at mi, see Figure (J2]l. 

Let a be the probability that Opt2 never visits m, G' be the expected gain of Opt2 if it never visits 
m, Gh be the expected gain of Opt2 given that j is observed in state h at node ?ti. Thus, the expected 
gain of Opt2 is aG' + (1 — a) '^ZhPhjGh- Let T^ be the decision tree in Opt2, and hence in Opt, after 
j is observed to be in a state h < i after being probed at m. Consider a new policy which is similar to 
Opt2 except that it replaces the decision tree rooted at mi with Tj for some f < i. Note that the gain 
of this new policy given that j is observed in state i at node m is at least Gf since the overall gain after 
observing a state i and a subsequent sequence of actions can not be less as compared to that after observing 
/ and the same subsequent sequence of actions. Thus, the expected gain of this new policy is at least 
aG' + (1 — a) J2h h^i PhjGh + (1 — C()PijGf. Since the expected gain of this policy can not exceed that of 
the optimum, Gf < Gi. 

13 



Now, consider another policy OptI which is similar to Opt2 except that it replaces the decision trees 
To, . . . , Tj^i (i.e., those rooted at nodes immediately downstream of m and corresponding to j being in 
states lower than i), with the decision tree rooted at tui (i.e., the one corresponding to j being in state i), 
refer Figure (|2]). Since the decision tree rooted at mi is a < i tree and the < i path ends in a backup, the gain 

of OptI is aC + (1 - a) (Ci ^h<iPhj + Y^h>iPhjGhj ■ Thus, since Gf < Gi, for f <i, the expected 
gain of OptI is at least as high as that for Opt2. Thus, OptI is also optimum. Note that OptI is similar 
to Opt except possibly for the decision tree rooted at m, which is a < i tree and satisfies conditions (pi) 
and (p2). The result follows. D 

We now state and prove a theorem which implies the Structure Theorem. 

Theorem 5.6 (Implies the Structure Theorem). There exists an optimum policy that uses a unique backup 
channel. If such an optimum policy probes at least one channel, it uses the backup channel at the end of a 
< i path originating from the root of its decision tree. 

Proof. Consider an optimum policy that does not use a backup channel at all. Consider the path in its 
decision tree which corresponds to all probed channels being in state 0. Modify the policy so as to use the 
last channel probed in this path as a backup. Note that the gain does not decrease. Thus, the modified 
policy is also optimum. Thus, there always exists an optimum policy that uses a backup channel at the 
end of some path in its decision tree. If one such optimum policy Opt3 does not probe any channel, then 
the theorem follows. Let Opt3 probe a channel. Clearly, then, Opt3 probes a channel j at the root node 
of its decision tree, say m. Let i be the highest state of j for which Opt3 transmits in a backup channel 



somewhere downstream of ra. Then, by lemma 5.5 there exists another optimum poUcy Opt4 for which 
the decision tree rooted at m, and hence the overall decision tree, is (a) a < i tree (b) uses a channel, say i, 
as a backup at the end of the < i path in the tree and (c) uses £ as a backup whenever it uses a backup. The 
theorem follows. D 

5.3 The Policy ReserveBkup(£): Algorithm and Intuition 

We require the subsequent definitions to specify the policy ReserveBkup(^). 

Definition 5.3. Fori = 1, . . . ,n„ letfi{u\ = '^^Xk^^''"" and pi[u] = Yl=u'^ Pvi- Let rQ[{}\ = -1 and 
r_i = -1. Let Wi = miuuiu : r^ > r^[0]]Q 

Note that r^ [0] is the probability of success if the sender uses £ as a backup. 

Definition 5.4. Let Hu^i = 4> for all u > K. For each i, starting from u = K — 1, down to u = wg, 
recursively, define H^/ = -^ i|i V}vv>u^f/^ andri[u\ — -=^^ > max(f^[0],ru_i) > \{^}. Assume that 
Ci/pi[u] = oo whenpi[u] = 0. 

' Note that Wf is well-defined for each ^ as (a) f£ [0] > ro = and (b) f ^ [0] < r a-- i which follows since px-i^ < Iforeach^. 
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ReserveBkup(^) 
(Probing Process:) 

While u> Wi and the highest state of a probed channel is lower than u, 



probe channels in H^i in non-increasing order of fj [u] — Cj/p 



u 



U — > M — 1 



(Selection Process:) 

Consider the channel j in the highest state y among all probed channels. 
(If no channel is probed, j = —1.) 
If r.y > fi [0] , then transmit in y, else transmit in £. 



2 Transmit o 




o/y 

/ Transmit on i 

Use k as backup Transmit on highest reward probed channel 

(i) ReserveBknp(k) (ii) RescrveBkup(O) 

Figure 3: The illustration of ReserveBkup(^) in Figure [T] 

Refer to Figure ^ for examples elucidating ReserveBkup(^). We now explain why ReserveBkup(£) 
attains the same gain as Opt(^) for ^ = 0, . . . , n. 

First, observe that RESERVED KUP(^) e V{£). This is because by definition i ^ liu.i for any u. Thus, 
Reserved KUP(^) never probes H (refer to the probing process). Also, note that ReserveBkup(^) does 
not use any channel other than i as backup (refer to the selection process). 

Next, since f£[0] is the probability of success when the sender transmits in the backup i, the channel 
selection for RESERVED KUP(£) is clearly optimal among policies that have the same probing sequence. 

We now explain the intuition behind the design of the probing process for ReserveDkup(£). First 
let ^ = 0. Once a sender observes that a probed channel is in state u it can not increase its gain any 
further by discovering another probed channel in state u or in a lower state. Thus, subsequently, it probes 
only the channels j for which the additional reward {rj\u + l]i5j[M + 1] — ruVj\u + 1]) exceeds the cost 
Cj, i.e., it probes the channels in H^p v > u. The incremental gain for probing a channel j then is 
fj[u + l]iJj[u + 1] — ruPj[u + 1] — Cj. The probing sequence in each Hu,e follows an increasing order of 
the ratio between this incremental gain and the probability that the channel is in a state that is higher than 
the highest observed state u (jjj[u + 1]). We now comment on the major differences between the probing 
processes of ReserveDkup(£) for £ > and i = 0. Note that f^[0] is the gain if the sender transmits in i 
without probing any channel j. If, the sender observes a channel to be in state u for which r^ < ^^[0], the 
observation does not increase the gain as compared to f£[0]. Hence, the sender considers the incremental 
gain as fj[u + l]pj[u + 1] — max(r„, f^[0])pj[ti + 1] instead of rj[u + l]pj[n + 1] — ruPj[u + 1], and, as 
before, probes only channels for which the incremental gain exceeds the probing cost. 

Finally, we determine the computation times of ReserveDkup(£) and DestReserveDkup. All 
the Hu,es can be evaluated in 0{nK) time and the sorting required to determine the probing sequence 
within each H^^i needs O(nlogn) time. Thus, the entire probing sequence and hence for any given I 
ReserveDkup(^), can be computed in 0{nK + nK log n) time or rather in 0{nK\ogn) time. Now, note 
that the computations for Hufl^ and the sorting order in each Hufi can be reused to determine Hu/^ for all ^s 
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in additional 0{n) time. Thus, all ReserveBkup(^)s can be computed in 0{nK log n) time. The gain for 
each ReserveBkup(£) can be computed in 0{nK) time. Thus, BestReserveBkup can be computed in 
0{n^K) time. 

5.4 Proof for Theorem gj] 



We now show that the policy ReserveBkup(£) described in Section 5.3 is optimal in the class of policies 
V{t). This completes the proof for Theorem |5.2| since we have already proved in Section 5.3 that for any 
given £ ReserveBkup(^), can be computed in 0{nK\ogn) time. 

First observe that the optimal in the class of policies V{t) need not be unique. We consider Opt(^) to 
be one optimal policy in V{tj that satisfies the following property. Suppose channel j is the last channel 
probed in a path in the decision tree of Opt(£) ; let m be the node at which j is probed. Then, Opt(^) would 
attain a lesser gain if it were not to probe j at m. Clearly, such optimal policies exist, and can be obtained 
by progressively removing the lowest node at which a channel is probed and which can be removed without 
reducing the gain. 

We first observe the following about Opt(£). Let the highest state of a probed channel be u when Opt(^) 
terminates its probing process. Then Opt(£) transmits in the probed channel if r^ > f£[0] and transmits in 
I otherwise. Thus, the channel selection for ReserveBkup(£) is optimal among all policies that have the 
same probing sequence. We now state and prove three lemmas which will establish that the probing process 



for ReserveBkup(£) is optimal in V{t}, and thereby prove Theorem 5.2 



Lemma 5.7. 1. If all channels in Ui.>max(Mto -i) ^^Z ^"^^^ already been probed, and the best state 
seen so far is u, then Opt{t) does not probe any further. 

2. Opt{£) can not terminate the probing process when there an is un-probed channel in Ut;>max(M w ~i) ^v,i' 
and the best state seen so far is u. 

Proof. The first part of the lemma clearly holds foiu = K — l since the optimal policy does not probe any 
further after observing a channel in state K — 1. Let the first part not hold for some u < K — 1. Thus, 
although all channels in U,)>max(« u; -i) ^^Z have already been probed, Opt(^) probes further channels. 
Let j be the last channel probed by Opt(^) in one such path. Let j be probed at node m of the decision tree. 
The highest state of a channel probed before j is probed at m be u. Note that after probing j Opt(£) transmits 
in (a) backup i if the maximum of u and j's state is u;^ — 1 or lower and (b) the channel that has the highest 
state among all the probed channels otherwise. Now, consider another policy C ^V{1) which is similar to 
Opt(^) except that it does not probe j at node m, and instead transmits in (a) backup lif u < w^ and (b) 
the probed channel that is in state u otherwise. Let Opt(£) and hence C reach node m with probability a. 
Clearly, a > 0, else node m can be removed from the decision tree of Opt(£) without reducing its gain. Let 
A be the difference between the gains of Opt(^) and C. We will arrive at a contradiction by showing that 
A < 0. Hence, the first part of the lemma holds. 

K-l 
^/« = X] Pkj max(rfc, Vu, r ([{)]) - Cj - max(r„, f^O]) 

k=0 

K-l 
= X] Pkj (max(rfc, r„, f^O]) - max(f^[0], r„)) - Cj 

A;=max(u,ui£ — 1) + 1 
K-l 
= X] P'^^ (^'^ ~ max(f40], r„)) - Cj. 

A;=max(u,ui£ — 1)+1 
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First, let u > W£ — 1. Thus, j UvyuH^/- Thus, A < 0. Now, let u < W£ — 1. Then, r„ < r^g-i < ^^[O]. 
Thus, A/a = Yyk=wePkj in - max(f40],r^,_i)) - Cj. Also, j \J^>^^ H^/. Thus, A < 0. 

The second part of the lemma holds by vacuity when u = K — 1 since Hk/ = 0- Let u < K — 1. Let 
the probability that Opt{i) terminates the probing process even when there an is un-probed channel in H^^i 
for some v > max(n, u;^ — 1), and the best state seen so far is n be ai > 0. Then, whenever the above event 
happens, Opt(£) transmits in (a) £ifu<wi and (b) a probed channel that is in state u otherwise. Consider 
another policy D € 'P(^) which is similar to Opt(^) except that whenever the above event happens it probes 
an additional channel j G H^^i, and transmits in (a) I if the maximum of u and j's state is ti;^ — 1 or lower 
and (b) the probed channel that has the highest state otherwise. Let Ai be the difference between the gains 
of Opt(£) and D. We will show that Ai < which is a contradiction. Hence, the second part of the lemma 
holds. 

K-l 

Ai/ai = max(r„, feP\) - ^ pkj max(rA:, r„, f^O]) + Cj 

k=0 
K-l 

= Cj - ^ pkj{rk-m.ax{fi[0],ru)) 

k=niSix{u,W£ — l)-\-l 

v-1 K~l 

= Cj - ^ Pkj [rk - max(f40] , r„)) - ^ pkj [rk - max(f40] , r^)) 

fc=niax(M,to^ — 1)+1 fc=i) 

K-l 

< Cj - ^ Pkj [vk - max(f£[0], r„_i)) . 

k=v 

The last inequality follows since r^ > max(f£[0],rM) for k > max(u, w^ — 1) and r^-i > r„ since 
V > max(u, We — 1) + 1. The result follows since j G H^/. D 

Lemma 5.8. Let W£ < u < K. OPT(i) probes all channels in H^/ before probing any channel that is not 
in Ut,>uif^_£. Also, OPTf£j probes all channels of H^^ unless one of the probed channels is in state u or a 
higher state, and probes these channels in non-increasing order offj[u] — ^-4^. 

Proof. Suppose the lemma does not hold. Then there exists a node in the decision tree of Opt(^), which 
Opt(£) visits with positive probabilitjPl such that the decisions at it violate the lemma. Let m be such a 
node which is also the farthest from the root node among those that violate this lemma. Then there exists 
a state q > W£ such that there exists a channel in Hg^i that has not been probed by Opt(^) before it visits 
m and the best state seen so far is g — 1 or worse. Let u be the highest state that satisfies both the above 
criteria and let j be the channel with the largest fj[u\ — ^=%j value among the un-probed channels of Hu/- 



At node m, Opt(^) does not probe j contradicting the lemma. From the second part of lemma 5.7 the 
probing process of Opt(£) can not terminate at m. Thus, Opt(^) probes some channel i y^ j at node m. 
Note that i UyyuHy^g. 

Since Opt(£) have already probed all channels in XJ^^uHu/, if channel i is in state u or a higher state. 



OPT(i') does not probe any further (first part of lemma 5.7 1, and transmits in i (since i has the highest state, 



say s, among the probed channels and rg > Vu > r^^ > fe[0]). If i is in a state n — 1 or a lower state. 



since Hu/ has un-probed channels, the probing process can not terminate (second part of lemma 5.7 1. Now, 
Opt(£) probes j next (otherwise m will not be the node farthest from the root to violate the lemma). Using 
similar arguments, it follows that after probing j, Opt(£) transmits in j if j is in state u or a higher state. 



^If the lemma is violated at a node which Opt(£) visits with probability, we can, without reducing the gain of Opt(/), alter 
the decisions at the node so as to satisfy the lemma. Hence, without any loss of generality, we assume that the decisions of Opt(^) 
satisfy the lemma at all such nodes. 
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The situation resembles the decision tree in Figure (|4^). The trees Ti . . . T^2 correspond to observing 
the ordered pair (i = u',j = u") where < u', u" < u — 1. The square boxes denote that Opt(£) do not 
probe anything else. 




(a) (b) 

Figure 4: The decision trees of Opt for m = 3 

Let Opt(£) not traverse node m with probability cti, traverse node m and stop after probing i or j 
with probability 02, traverse node m and continue after probing j with probability 03. By assumption, 
02 > 0. Let the conditional expected gains given these scenarios be Gi, G2, G3 respectively. Then, the 
expected gain of Opt(£) is Gopt(^) = ^i=i ctiGi. Let the total probing cost en-route to node mhe Ci. 
Then, G2 = ^^ {pi[u]ri[u] - q + (1 - Pi[u])[pj[u]rj[u] - Cj] - Ci) . 

Now consider an alternate policy A that is similar to Opt(^) except for the tree rooted at node m. 
Figure Q shows the tree rooted at node m in policy ^ for n = 3. Here, A probes j at node m and 
subsequently probes i unless j is in state n or a higher state. The tree T' corresponding to the ordered pair 
(z = u' ,j = u") in Opt is assigned on the branch corresponding to the ordered pair (j = u" , i = u') in 
A. The contributions to the gain from the trees Ti, . . . r„2 remain the same because in both the scenarios 
the probabilities of probing these trees are the same. Thus, the expected gain of this new policy is Gc = 
aiGi + 02^2 + asGa, where G'2 = ^^ ipj[u]fj[u] - Cj + (1 - pj[u])[pi[u]fi[u] - Ci] - Ci). 



Now ifi S Hu^i then we have fj [u] - 



Pi 



u 



> rAul 



Pi 



is which is the condition that arises from violating 

u\ 



the non-increasing order. If i H^i, then since i U^^uH^^i and u > we, ri[u] - 



< max(r„_i,f£[0]). 



But fj[u] 



c. 

pj 



Pi m 
J^ > max(rM_i,f^[0]) since j G Hu£ and u > wg. Therefore, in both cases we have 

'^— > ri\u\ — j^. But this implies that 



u 



p^m 



Gc — G, 



Opt(£) 



1 — ai 
Pj[u\fj[u\ - Cj + {l -Pj[u\) {pi[u\fi[u\ -Ci} 

-pi [u] fi[u\+Ci-{l- Pi [u] ) {pj [u] fj [u] - Cj } 



Pj[u\ 



ri\u\ + 



Pirn 



>0. 



Thus, since ai < 1, Gc > Gopt(£)- Thus, we arrive at a contradiction. The result follows. 



D 



Lemma 5.9. OPTf^j probes only channels in U^J^i/t,,, 



Proof. From the first part of lemma 5.7 Opt(^) terminates its probing process after probing all channels in 



5.8 



U^^J;^Hy£. From lemma 

is not in U^S^^H^j. The result follows. D 



Opt(^) must probe all channels in U^J -ff^, ^ before probing a channel that 



The optimality of the probing process for ReserveBkup(^) in 'P{tj follows from lemmas 5.8 and 5.9 
Thus, Theorem [5^ follows. 



6 Additive Approximation Schemes for Equal Probing Costs 

We now consider the case that all channels have equal probing costs, i.e., Cj = c > 0, but still allow for 
different distributions for the state processes of different channels. This assumption is motivated by the fact 
that oftentimes the probing cost is determined by the energy consumed in transmitting the probe packets 
which is usually similar for different channels. We still assume that the sender is saturated. We present a 
policy that given any e > attains a gain of at least Gqpt — e'^x-i • The time to compute this poUcy depends 
exponentially on -, but is a polynomial in n for any fixed e > 0. 



Motivated by the Structure Theorem (Theorem 5.6 1, we consider the following class of policies 



Definition 6.1. Let P{i, i) be the class of policies which (a) take the same actions if a probed channel is 
observed in state i' or i" at any node when i\ i" < i and (b) take backup i at the end of the < i path 
originating from the roots of their decision trees and do not take backups anywhere else. Let Opt{i, i) 
denote the optimum policy in this class with gain G{i, i). 



We know from Theorem 5.6 that OPT is in P{£*,i*) for some £* E {l,...,n} and some i* S 
{0, . . . ,K — 1}. Therefore it suffices to provide approximations of the Opt{£,i) policies for different 
i, i. Note that the policy that approximates Opt{i, i) need not be in P{1, i). 

We now prove the central lemma in this section, which also presents the approximation algorithm. 

Lemma 6.1. We can compute in 0{n^'^hK) time a policy whose gain is at least G{£, i) — er^-i, where 
h = l+ [-logi/(i_,)e]. 

Proof. First, let c < erx-i- Now, observe that OptNoBkup attains a gain of at least G{£,i) — er^-i 
since it attains a gain of at least G{i, i) — c. To see the latter, note that if Opt{i, i) is modified to first probe 
a. and subsequently transmit in £ wherever it were using ^ as a backup, then its gain decreases by at most c. 
Thus, the modified policy has a gain of at least G{1, i) — a. Now, since the modified policy does not use any 
backup channel, its gain is at most that of OptNoBkup. The result follows. 

Now, let c > evK-i- Note that Opt{i,i) does not transmit in a probed channel whose state is i or 



less. Now, using a proof which is the same as that for Claim 5.4 it follows that if Opt{i, i) probes a channel 
u ^ i, and if u is not the first channel probed, then Yli=i+i Pju^j ^ c > er/^_i, and hence Ylij=i+i Pju > ^■ 
Now, consider the < i— path of Opt{£, i) starting from the root, which ends in the backup. The probability 
of continuing on this path decreases by a factor of 1 — e for each additional node after the first node (escape 
probability is at least e after the first node). Therefore, the probability q of continuing beyond h nodes in 
this path is less than e, since h = l + \— logi /(i_^) e] . Thus, if Opt{i, i) is modified to take the backup after 
h nodes in this path, the gain of the modified policy is at least G{1, i) — erx-i- Let P{i, i, h) be the class 
of policies that are in P{C, i) and have h or fewer nodes in their < i path originating from their roots. Thus, 
the policy that has the maximum gain among all policies in P{i, i, h) has gain at least G{1, i) — erx-i- 

Thus, the result follows if we can compute the policy that has the maximum gain in P{£, i, h) in 
0{n^~^'^hK) time. Note that a channel can be probed at most once in the < i path originating from the 
root for this policy, and thus the number of possible probing sequences for this path is bounded by 0{n^). 
For each probing sequence in this path, we compute as follows the rest of the actions so as to maximize the 
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gain. Suppose that we are at a node t in this path where we just probed channel x and the set of probed 
channels (including x) is Q. Since the probing sequence in the < i path is given, we only need to determine 
the actions downstream if x is in state j > i. In this case, we know from Definition |6.1[ that a backup is 
not used downstream. Also, only the channels in Q can be probed downstream. Further, all channels which 
are in states < j will not be used for transmission, and probing them do not increase the gain. Therefore we 
can pretend that we have a new system over Q where r" = r^ — rj if s > j and otherwise. We can use 
OptNoBkup on the channels of Q with rewards {r"} to find an optimal subtree. 

Thus, given the probing sequence in the < i path, the rest of the tree can be computed using h appli- 
cations of OptNoBkup, which requires O {hriK log n)) time. The gain of each tree can be computed in 
0{nKh) time. Thus, 0{hnKlogn) time is needed in this part. Since there are 0{n^) probing sequences 
for the < i path, the policy that attains the maximum gain in P{i, i, h) can be computed in 0{n^^'^hK) 
time. The result follows. D 



From lemma 6.1 and since Gqpt = G{1, i), for some i G {1, . . . , n} and z G {0, . . . , /f — 1}, the policy 
that has the maximum gain among those that attain gains of G{i, i) — erx-i for different £ G {1, . . . , n} and 
i G {0, . . . , -R' — 1} attains a gain of at least Gqpt — ^fx-i- We can compute such a policy in 0{n^^^hK'^) 
time. We therefore obtain the following theorem. 

Theorem 6.2. In 0{nP'^^hK'^) time we can compute a policy whose gain is at least Gqpt — ^^K-i where 
h = l+ [-logi/(i_,)e]. 

Finally, note that since we are focusing on only an additive approximation, the computation time can be 
made independent of K. First, divide [0, rx-i] in disjoint intervals of size e/2. Then, consider a new system 
where the probability of success in each state i equals (e/2) [2rj/eJ . This new system effectively consists 
of at most 2rx-i/e < 2/e states. In this system, the approximate policy we just developed, approximates 
the optimum gain within an additive factor of e/2. The gain of the optimum policy in this system is at least 
Copt — e/2. Thus, the approximate policy computed in this system attains a gain of at least Gopt — e in the 
actual system. Note that irrespective of the number of states in the original system the time required for 
computing the approximate policy in the new system is 0{n^~^^h/e^) where h = 1 + \— logi/(i„e/2) ^/^l • 

7 The Unsaturated Sender Problem 

We now consider the case that the sender may not always have packets to transmit. We assume that the 
sender generates packets as per a positive recurrent Markovian arrival process such that the average number 
of packets arriving in its queue under the steady state distribution of the arrival process is A. Thus, the sender 
may choose not to transmit in a given slot even when her queue is non-empty, e.g., when the transmission 
conditions of the probed channels are not acceptable for her. However, the sender needs to transmit at rate 
A, else her queue will become unstable. Thus, we need to jointly optimize the probing, channel selection 
and transmission decisions so as to maximize the gain subject to stabilizing the sender's queue. Specifically, 
we seek to solve Problem 2 formulated in Section [3] As stated in Section |3] we consider n channels with 
K states and potentially unequal probing costs and different distributions for the state processes. We note 
that no previous results - even exponential time policies, were known for this problem. 

We will assume that the optimal policy OptUnSat is ergodic, and denote its gain by Gu- We present 
a stable policy that (a) attains a gain of Gu for i^T = 2 and at least (2/3) G^/ for K > 2 and (b) can be 
computed in 0{n^K{n + K)) time. 

7.1 Roadmap and Main Results 

We first introduce the following definitions. 
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Definition 7.1. Let 11 denote the set of decision trees. Let C^ denote the expected probing cost in decision 
tree o" G 11, and M^ be the set of leaf nodes where the decision is to transmit. Letpma denote the probability 
that the leaf node m is reached in a, and ifm G Ma, fma is the probability of success for the transmission 
at m. Let S^ = J2m€Ma P"^" ^"^ ^-^ = Y^meM^r ^rnuPma -C^foraa £ n. 

Since the number of channels and the number of transmission states of the channels are both finite, 11 
constitutes a finite set. Note that now a decision tree may not transmit in all leaf nodes. For example, the 
decision trees in Figure [TJa),(b) are examples of decision trees in 11, and in addition, if the decision trees in 
Figure [TJb) are modified so as not to transmit at the end of the left-most paths, the modifications will also 
constitute decision trees in 11. Note that 11 also includes the decision tree that neither probes nor transmits 
in any channel. Thus, if the sender takes actions as per the decision tree cr in a slot in which she has packets 
to transmit, she transmits with probability Sa in the slot and attains a gain of Qa in the slot. 

Step 1: Expressing the optimal policy as the solution of a Linear Program: Throughout this discus- 
sion, we assume that e G (0, ^ — 1) is a suitably chosen small constant. Consider the following linear 
program LPUNSAT(e). 

LPUNSAT(e): Maximize y^ PaGa 

o-en 

X^o-en /^o-'^o" ~ -^(1 + ^) {stability constraint) 

Pa > V CT G n 

Definition 7.2. Let {/3*(e)} be the optimum solution o/LPUNSAT(e). Let Q*{e) denote the optimal value 
of the objective function. 

We will prove that an arbitrary close approximation for the optimal policy can be obtained using {/3* (e)}. 
But, {/3*(e)} does not provide an exact solution because the stability constraint in LPUNSAT(e) involves a 
positive e, which is required to ensure stability. Thus, the approximation improves with decrease of e. We 
first prove the following results. 

Proposition 7.1. For any < ei < e2 < ^ - 1, Q*{ei) < Q*{e2). 

Proof Let {/3} denote the optimal solution for LPUNSAT(ei). Thus, J2a l^(r<Sa = A(l + ei) < A(l + 
€2). There exists a policy x such that /3a; > and Sx < 1, else ^^ Pa-Sa = 1 > A(l + ei). Let x*^ denote 
the policy which transmits at all leaf nodes, but is otherwise similar to x. Thus, S^c = 1, and ^^.c > Gx- 
Now increase P^^ ^^'^ decrease Px such that their sum remains the same. This change increases ^^ PuSa, 
ensures that the objective value does not decrease and that the ^^ Pa- remains the same. Continue this 
process until ^^ PaSa = A(l + £2). This yields a feasible solution to LPUNSAT(e2) whose value is at least 
g*(ei). □ 

The next lemma provides an upper bound for the gain of the optimal policy. 

Lemma 7.1. Gu < Q*(e) V e G [0, \ - 1). 

Proof. Let P^ denote the probability with which the optimal policy OptUnSat chooses policy a. Then 
{P^} forms a feasible solution to LPUnSat(O), and the expected gain of this policy is simply Xlfren l^aGa- 
Thus, Gu < Q*(0). By Proposition [VjJ we have Q*(0) < Q*{e), which completes the proof. D 
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We next show that any feasible solution {/3(e)} to LPUNSAT(e) yields a stable policy, UNSAT(/3(e)), 
whose gain is close to the corresponding objective value of LPUNSAT(e). We describe policy UNSAT(/3(e)) 
after introducing the following definition. A slot in which the sender's queue is non-empty is referred to as 
a busy slot. 

Policy UNSAT(/3(e)) 

In each busy slot, select a cr G 11 in accordance with the probability distribution {/3(e)}, and probe channels, 

decide whether to transmit, and select a channel as per a. 



Lemma 7.2. Assume that e £ {0, j — 1). //'{/3(e)} is a feasible solution o/LPUNSAT(e) and has objective 
value Q(e), then UnS AT(/3(e)) is stable and attains a gain of j^^- 

Proof. In any busy slot, UNSAT(/3(e)) selects a decision tree a in accordance with the probability distribu- 
tion /3(e). Thus, the stability constraint in LPUNSAT(e) ensures that in any busy slot the sender transmits 
packets with probability A(l + e) which exceeds A. Since the state of the arrival process and the queue length 
under UNSAT(/3(e)) constitutes a Markov chain, stability follows from standard results and analytical tech- 
niques (Theorem 2.2.3 in ||T2|, H). 

Since the system is stable and UNSAT(/3(e)) transmits a packet with probability A(l + e) in each busy 
slot, using Little's law, at least j^ of slots are busy. The gain of UNSAT(/3(e)) in each busy slot is (5(e). 
Thus, UNSAT(^(e)) attains a gain of at least (5(e)/(l + e). D 

In view of Lemmas 7.1 and7.2 UNSAT(/3*(e)) is stable and attains a gain of ^^ > Gu/i^ + e)- But, 
we do not know how to solve LPUNSAT(e) in polynomial time. We therefore seek to obtain constant factor 
approximations of the optimal solution of LPUnS AT(e) in polynomial time. This motivates the following 
definitions. 

Definition 7.3. A c-approximation to LPUNSAT(e) is a feasible solution {/3} o/ LPUNSAT(e) /or which 
the objective function is at least cQ*{e). A c-approximation to the unsaturated sender problem constitutes a 
(potentially randomized) stable policy that attains a gain of at least cGij. 



It follows from Lemmas 



7.1 



and 



7.2 



that for any e G (0, ^ — 1), a c-approximation to LPUNSAT(e) 
easily yields a c/(l + e) -approximation to the unsaturated sender problem. We therefore focus on obtaining 
a c-approximation to LPUnS AT(e) in polynomial time. 

Step 2: Considering the Lagrangean Relaxation. We consider a Lagrangean Relaxation LPLAGRANGE(e, C) 
for £ > 0. 

LPLAGRANGE(e, C)\ Maximize ^ (3aQa + £ I A(l + e) - ^ fi^Sa J 

cren V eren / 

Z/o-en /?o- = 1 

I3„ > V cj e n 

Note that the optimal solution of LPLAGRANGE(e, C) uses only (a) the as that always transmit when 
£ < and (b) the crs that never transmit when C > tk-i- The hope is that the ideal Lagrange multiplier 
£* would ensure that ^^-gn l^ryS^ = A(l + e) for the optimum solution of LPLAGRANGE(e, £*), and we 
would have a solution for LPUnS AT(e). However, the computation time for finding such a £* is the same 
as that for the original problem! 

We proceed as follows to circumvent this difficulty. We obtain a c— approximate solution for LPUnS AT(e) 
in polynomial time using the following observation and the subsequent lemma. 
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Proposition 7.2. For any C >0, there exists an optimum solution o/LPLAGRANGE(e, C) in which (3^ = 1 
for some a = a'-', and /3o- = if a G H \ {cr^}. 

The above proposition motivates the following definition. 

Definition 7.4. For any C >0, a policy a Gil is said to c-approximate LPLAGRANGE(e, C) ifG<j — CS^ > 
c{Q(ji — CS'^) for any a' G 11. 

Lemma 7.3. Assume that e £ {0, j — 1) and < c < 1. Suppose we have two decision trees o"+,(j_ 
that c-approximate LPLAGRANGE(e, £^) and LPLAGRANGE(e,£^) respectively. Suppose, further that 
<Sa+ < A(l + e) < Sa_ and that < £+ — C~ < ceQ*{e). Consider {/?} such that (i^+ = a, j3„- = 1 — a, 

and f5a = Ofor o" G 11 \ {a+, a-} where a = ^' _^ . 

Then {/?} constitutes a feasible solution for hPUNSAT^e) and ^^j-^t^ ^ i PaGa > c(l — e)Q*{e). 

We prove the above lemma in Section [7!2| Lemmas [7. 1[|7.2[ [73] imply the following fact. 



Proposition 7.3. If{P} satisfies the conditions in lemma \773\ UnS at(/3) is ac{l — e)/{l+e) -approximation 
to the unsaturated sender problem. 

Step 3: Finding the two solutions. We now address the following important issues: (1) how to obtain 
c-approximations for LPLAGRANGE(e,£) for arbitrary C and (2) how to obtain two C^,C~ such that 
the respective c-approximations a+,a- for LPLAGRANGE(e,£+), LPLAGRANGE(e,£~) satisfy 5^+ < 
A(l + e) < Sry_ . We first observe that the objective function of LPLAGRANGE(e, C) can be expressed as 



Yl /3-^- + ^ (a(1 + e) - 5^ PaSA = C{\{1 + e)) + Y,fiA J^ (f^, - C)p„ 
o-en \ o-en / o-eii Xm&M^ 



a 



Thus, optimizing or approximating the above quantity is similar to optimizing or approximating the 
saturated sender problem in a system where (a) the reward of transmitting in a channel in state tti is r(„ = 
Tm — £ and (b) the sender may choose not to transmit in a slot. The shift in the rewards and the option 
of not transmitting leads to some important differences with the saturated sender problem we considered 
earlier (Problem 1). Specifically, the optimal policy in this system will not transmit in a probed channel 
that is in a state m such that r^ < C, but may transmit in a backup channel in such a state m. Thus, the 
reward of transmitting in a probed channel is non-negative, whereas the reward of transmitting in a backup 
channel may be negative. Owing to these differences, the proof of the 4/5-approximation no longer holds in 
this system. Nevertheless, we obtain a 2/3-approximation for this system. We first introduce the following 
definitions. 

Definition 7.5. Consider a system where the sender (a) is saturated (i.e., always has packets to transmit) 
(b) attains a reward ofr^ — x if it transmits in a channel in state m (c) acquires a cost of Ci when it probes 
channel i and (d) may choose not to transmit in a slot. We refer to this system as the SATURATED ALTERED 
Reward (x) system, and let T{x) be the problem of maximizing the gain in this system. A policy is said to 
c-approximate the T{x) problem if its gain in this system is at least c times the maximum gain in this system. 

Note that the policy that solves (c-approximates, respectively) the T{C) problem solves (c-approximates, 
respectively) the LPLAGRANGE(e, C) problem. 

Lemma 7.4. We can solve T{x) optimally (c = 1) for K = 2 and achieve a c = 2/3 approximation for 
K > 2in 0{n'^K) time. 
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The optimum policy in a class of 'threshold type" policies RESERVED KUP(^, x) provides the above optimal 
and approximate solutions for different values of K; as the name suggests, these threshold-type policies 



are extensions of ReserveBkup(£). We present these threshold type policies in Section 7.3 and prove 



lemma 7.4 using these policies in Section 7.4 



We finally prove in Section [73] the last piece, namely: 



Lemma 7.5. Assume that e G (0, j — 1). Using O {ii?'K{n + K)^ time, we can compute o"+, a-,C^ ,C 
which satisfy the properties of lemma \ 7. 3\for (a) c = Ifor K = 2 and (b) c = 2/3 for K > 2. 



We present a constructive policy for computing the above o-_|_ , o-_ , £+ , £ as part of our proof. 



The following Theorem follows from proposition 7.3 and lemma |73 



Theorem 7.6. Assume that e G (0, i^ — 1). We can compute a c(l — e)/(l + e) -approximation for the 
unsaturated sender problem where c = Ifor K = 2 and c = 2/3 for K > 2 in O {v?K{n + K)^ time. 

Since the computation time does not depend on e, by selecting small e, we can attain in polynomial time 
an approximation factor close to 1 for K = 2 and 2/3 for i^ > 2. We summarize the policy that attains the 



above performance guarantee in Section 7.6 



Finally, the idea of using an approximation algorithm for the Lagrangean relaxation of an optimization 
problem, and performing a parametric search to satisfy the constraint while preserving the approximation 
ratio, has a rich history in approximation algorithms. It is the method of choice for network design prob- 
lems when there is a hard bound on the resource allocation constraint, for instance, A;-medians fTF] and 
fc-MST [7|. We extend this technique to deal with the hard constraint on the rate of transmissions, and our re- 
sults constitute the first application of this technique to policy design. This technique yields threshold-based 
reward policies, which suggests that this technique will have interesting connections to the retirement-based 
index policies |[T4l for multi- armed bandit problems - these connections will be explored in future work. 



7.2 Proof of Lemma |73] 

We first prove that {/?} constitutes a feasible solution to for LPUNSAT(e). Since 5^+ < A(l + e) < S(j_, 
f3a+ G [0, 1) and Pa_ G (0, 1]. Finally, note that ^^^n P^ = 1 and J^aen (^'^^'^ = K^ + ^)- The resuh 
follows. 

We now prove that ^^gr ^ .[i^Q^ > c(l—e)(5*(e). Notethatsince(T+ c— approximates LPLAGRANGE(e, £+) 
and < c < 1 and ^^^n /3*(e)a = 1, 



g,^ + C+ {\{l + e) - S^^) >c 
And likewise for £^ , 

Ga. + C- {X{1 + e) - S^_) >c 



Y, r (e).^. + C+ ( A(l + 6) - ^ f3*ie)^sA 
.o-en \ o-en / 

j; r (e).^. + £- ( A(l + e) - ^ /3*(e).5j 
.o-en V o-en / 



(3) 



(4) 



Since {/3*(e)} is a feasible of LPUNSAT(e) , Y^aen /^^(e)^^^ = A(l + e). Thus the terms 
(A(1 + e) — X^CTgn P*{^)crScr) can be removed from the respective right hand sides of the equations p^ and 
[4] We now multiply equation [3] by a and equation]?] by 1 — a and add the resulting equations. The right 
hand side of the sum evaluates to c^^g^ /3*(e)o-^CT = cQ*{e) . Since a5o-_|. + (1 — a)Sa_ = A(l + e), the 
left hand side becomes aQa^ + (1 — ci)Ga^ — (-^^ — -^^)(1 — «) (A(l + e) — 5o-_) . Thus, we have 

aGa+ + (1 - a)g„, - (£+ -£-)(!-«) (A(l + e) - 5<,„) > cQ*{e) 



24 



Now since < A(l + e) < 5^_ < 1, -1 < (l-a)(A(l + e) -5<^_) < 0. Thus, since <(£+-£")< 
ceQ*{e) we have 

ag„+ + (1 - a)g„. > cQ*{e) - (£+ - £-) > c(l - e)g*(e) 

The result follows since /3cr+ = a and /3(j_ = 1 — a. 

7.3 Threshold poUcies for c-approximating LPLAGRANGE(e, C) 

We first generalize the definition for Hu,e as follows. 

Definition 7.6. Le? Hu/,x = ^/o'' a^/ u > K. For each i, starting from u = K — 1, down to u = wi, 
recursively, define Hu/,x = |^|« {]v:v>u Hv,e,x, and fi[u] - ^ > max{fe[0],ru-i,x)\ \{£}. Assume 
thatCi/pi[u\ = oowhenpi[u] = 0. Letwi^x = niin^jw : n > 0,r„ > niax{ri[0],x)}.lfru < max(f^[0],x) 
for all u, Wi^x = ^■ 

ReserveBkup(^, x) 

(Probing Process:) 

u^ K -I 

While u > Wi^x and the highest state of a prohed channel is lower than u, 
probe channels in Hu,g.x in non-increasing order of fj [u] — Cj /pj [u] . 

u — > w — 1 

(Selection Process:) 

Consider the channel j in the highest state y among all probed channels. 
(If no channel is probed, j — —1.) 
If max(ry, f£[0]) < x, do not transmit. If max(rj,,f£[0]) > x, transmit in j if Tj, > fi[Q\ and transmit in £ otherwise. 

Note that ReserveBkup(£, x) is similar to ReserveBkup(£) except that ReserveBkup(^, x) se- 
lects a transmission threshold, x, apriori, and transmits only if a probed channel is in state j or a higher 
state such that rj > x or if the probability of success of the backup channel £ is not lower than x. Thus, 
Reserved KUP(£, x), probes only those channels for which the expected rewards conditioned on being in 
states k and above, where r^ > max(f£[0], x), exceed the probing cost. 

Definition 7.7. Let BestReserveBkup(x) be the ReserveBkup(£, x)for that £for which it attains the 
maximum gain among all i £ {0, . . . , n}, and ax denote the decision tree o/BestResERVEBkup(x) 

In the next section, we prove that BestReserveBkup(x) optimally solves T{x) for K = 2 and 2/3- 



approximates T(x) for K > 2. Lemma 7.4 follows since BestReserveBkup(x) can be computed in 
Oiji^K) time. 



7.4 Proof of Lemma 17^ 

The proof relies on the following Generalized Structure Theorem which is similar to the Structure Theo- 
rem [5?l] 

Theorem 7.7 (GeneraUzed Structure Theorem). Consider the Saturated Altered Reward (x) system 



described in Definition 7.5 There exists an optimum policy in this system that uses a unique backup channel 



whenever it transmits in a backup channel. The backup channel is used at the end of one < i path. 
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Proof. If no optimum policy transmits in a backup channel, the theorem clearly holds. Now, suppose there 
exists an optimum policy OptI that transmits in a backup channel at the end of some path in its decision 



tree. Observe that lemma 5.5 holds in this system. The proof is the same as that in the original system. If 
OptI does not probe any channel, then the theorem follows as well. Let OptI probe a channel. Clearly, 
then, OptI probes a channel j at the root node, say m, of its decision tree. Let i be the highest state of j for 



which OptI transmits in a backup channel somewhere downstream of m. Then, by lemma 5.5 there exists 
another optimum policy Opt2 for which the decision tree rooted at m, and hence the overall decision tree, 
is (a) a < i tree (b) uses a channel, say I, as a backup at the end of the < i path in the tree and (c) uses i as 
a backup whenever it uses a backup. The theorem follows. D 



Proof, (of Lemma 7.4 1 



Consider the Saturated Altered Reward (x) system described in Definition 7.5 The set of 
decision trees in this system is 11, irrespective of x. Note that the gain of any cr E 11 in this system is 
Q(j — xSa and depends on x. Let a* be a policy that attains the maximum gain in this system (and hence 
solves problem T{x)), F = g„^ - xS„^ and BEST = Q^* - xS^*. We need to prove that F = BEST 



foTK = 2 and F > {2/3)BEST for K > 2. Lemma |X4] follows since BestReserveBkup(x) can be 
computed in 0{n'^K) time. 

In this system, for each x, multiple a may maximize the gain. The generalized structure theorem (The- 
orem 7.7 ) shows that for each x at least one a* uses a unique backup channel whenever a backup channel is 
used for transmission. We therefore consider a o"* that uses a unique backup, say L Let a* use I as backup 
with probability a. Let R = J2i:r,>xPid'^i " ^) ^^d T = Y.r.r,<xVii>{x - ri). 

Let V'{j) be the set of all decision trees in 11 that use no channel other than j as a backup and do not 
probe j. Note that policies in V'{0) never transmit in any backup channel. Using similar arguments as in 



the proof of Theorem 5.2 (which holds for both K = 2 and K > 2), it can be shown that in this system 
Reserved KUP(j, x) attains the maximum gain among all policies in V'{j). Thus, F is the maximum gain 
attained in this system by any policy in U"^q'P'(j). 

Let K = 2.We now prove that a* £ U^=o^'(i)- ^^ follows that F > BEST. Let x>ri. Now, a* does 
not transmit in any channel. Thus, a* G V'{0). The result follows. Now, let x < ri. Let a* U^^q'P'(j). 



Given Theorem 7.7 the above happens only when a* probes a channel j in one path and uses it as a backup 
in another path. We now rule this out. Now, if a probed channel is in the highest state, state 1, a* transmits 
in that channel. Thus, a* consists of only one path, say P, and some other links that originate from P. Each 
of these links correspond to the case that a probed channel is in state 1 and leads to a leaf node at which a* 
transmits in the probed channel. Thus, a* can transmit in a backup channel only at the end of P, but then it 
can not have probed the channel in P, and hence does not probe the channel in any other path as well. The 
result follows. 

Next, let K > 2. Now, construct a policy ai that is similar to a* except that whenever a* uses £ as a 
backup, (Ti does not transmit. Thus cji attains a gain of at least BEST - a{R — T). Also, cji G 'P'(O). Thus, 
its gain is upper bounded by F. Thus F > BEST — a{R — T). Now, consider the policy that transmits 
in (. every slot without probing any channel. This policy is in V'{t) and attains a gain of R — T. Thus, 
F>R-T.\i follows that 

(1 + a)F > BEST. (5) 

Next, construct another policy a2 that is similar to a* except that (72 never probes i; instead wherever 
a* probes £, 02 follows the same course of actions as a% does after discovering £ in state 0. The gain of 02 
is at least BEST — (1 — a)(i? — q), because if i was in state i such that rj < x, a* will never transmit in £. 
Also, (72 G V'{tj. Therefore, F > BEST — (1 — a)(i? — q). Now consider another policy us that probes 
i and subsequently transmits only if ^ is in a state i such that ri > x; a^ neither probes nor transmits in any 
other channel. Clearly, as attains a gain oi R — ce. Also, 0-3 G 'P'(O). Thus, F > R — ce. Combining the 
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last two equations, 

(2 - a)F > BEST. (6) 

Adding (|5} and ^, we get F > {2/3)BEST. The result follows. D 

7.5 Proof of Lemma |73] 

We now describe how the parameters £+, C~ and a^, a- are selected. 

Definition 7.8. Let THRESHOLD be an array consisting ofn+K+2 elements which are — 1, 2, tq, . . . , r^-i 
and f 1 [0] , . . . , f„ [0] sorted in an increasing order. 

Note that the decision tree ax of BESTRESERVEBKUP(a;) is the same for all x G (THRESHOLD[i], 
THRESHOLD[i + 1]). Thus, THRESHOLD is the collection of thresholds x corresponding to different values 

of fJx- 

Definition 7.9. Fori < i < n + K + llet aibe ax(i-e., BestReserveBkup (x)j/or j; = THRESHOLD[i]. 

Lemma 7.8. For e G (0, 1/A — 1), there exists an i such that S^^ > A(l + e) > S^ij^^. 

Proof. Now, Sa^ = 1 for X < and S^^ = for x > rx-i- Thus, since THRESHOLD [1] = -1 and 
THRESHOLD[n + K + 2] > rx-i, S^^ = 1 > A(l + e) and 5^„+j^+i = < A(l + e). The result follows 
since 5^ > 5^ , , for each i. D 



Proof, (of Lemma 7.5 1 Let i* be the i found by Lemma 7.8 and A = min( '^^91M. ^ (THRESHOLD[i* + 1] - 
THRESHOLD[r])/2)7^ 

If Sa^ < A(l + e), for X G (THRESHOLD[i*], THRESHOLD[i* + 1]), £+ = THRESHOLD[i*] + 
A, and £" =THRESHOLD[i*]. If 5cr^ > A(l + e), for X £ (THRESHOLD[r], THRESHOLD[i* + 1]), 
£+ =THRESHOLD[i* + 1], and £~ = THRESHOLD[i* + 1] - A. Now, (t+ = ac+,cr~ = (Jc-- 



7.3 



Since 5^.* > A(l + e) > Sa^,_^_-^ , in both cases C^,C~ and a+, a^ satisfy the properties of lemma 

Note that we need to compute a^ for 0{n + K) values of x. Thus, the guarantee on the computation 

time follows since ax can be computed in 0{n^K) time (lemma 7.4 1. D 



7.6 Algorithm Summary 

1+e 



We now summarize the design of the stable policy, UNSATAppROx(e) that attains a gain of jr^Gij for 



K = 2 and {2/3)j^Gu ior K > 2 (Theorem 1.6\. Recall that ax is BestReserveBkup(x), &{ is ax for 



THRESHOLD[i] (Definition 7.8 1 



UNSATAPPROX(e) 



1. Compute £+, C as in the proof of lemma 7.5 and let o-+, cr denote BestReserveBkup(£+) and 
BestReserveBkup(£~) respectively. 

2. In each busy slot, use cr+ with probability a and cr~ with probability I — a, where a = g~ _g . 
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8 Conclusions 

We have presented a simple model for studying the information acquisition and exploitation trade-off at 
a single wireless node, when the multiple available channels are multi-state, and the channel distributions 
and information acquisition costs could be different. We presented a general solution framework based on 
exploiting the structure of the optimal policy, and by using Lagrangean relaxations to simplify the space of 
approximately optimal solutions. We believe these techniques will have wider applicability, in particular 
when we consider the multiple node scenario. 
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